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Abstract —We consider the problem of joint estimation of 
structured covariance matrices. Assuming the structure is un¬ 
known, estimation is achieved using heterogeneous training sets. 
Namely, given groups of measurements coming from centered 
populations with different covariances, our aim is to determine 
the mutual structure of these covariance matrices and estimate 
them. Supposing that the covariances span a low dimensional 
affine subspace in the space of symmetric matrices, we develop a 
new efficient algorithm discovering the structure and using it to 
improve the estimation. Our technique is based on the application 
of principal component analysis in the matrix space. We also 
derive an upper performance bound of the proposed algorithm 
in the Gaussian scenario and compare it with the Cramer-Rao 
lower bound. Numerical simulations are presented to illustrate 
the performance benefits of the proposed method. 

Index Terms —Structured covariance estimation, joint covari¬ 
ance estimation. 

I. Introduction 

Large scale covariance matrix estimation using a small 
number of measurements is a fundamental problem in modern 
multivariate statistics. In such environments the Sample Co- 
variance Matrix (SCM) may provide a poor estimate of the 
true covariance. The scarce amount of samples is especially 
a problem in financial data analysis where few stationary 
monthly observations of numerous stock indexes are used to 
estimate the joint covariance matrix of the stock returns [1, 2], 
bioinformatics where clustering of genes is obtained based on 
gene sequences sampled from a small population [3], com¬ 
putational immunology where correlations among mutations 
in viral strains are estimated from sampled viral sequences 
and used as a basis of novel vaccine design [4, 5], psychology 
where the covariance matrix of multiple psychological traits is 
estimated from data collected on a group of tested individuals 
[6], or electrical engineering at large where signal samples 
extracted from a possibly short time window are used to 
retrieve parameters of the signal [7]. 

The most common approach to work around the sample 
deficiency problem is to introduce prior information, which 
reduces the number of degrees of freedom in the model and 
allows accurate estimation with few samples. Prior knowledge 
on the structure can originate from the physics of the underly¬ 
ing phenomena or from similar datasets, see e.g. [8-10]. The 
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most widely used approaches here are shrinkage towards a 
known matrix [11, 12], or assuming the true covariance matrix 
to be structured [13-21]. We focus on prior knowledge in 
the form of structural constraints, which are most commonly 
affine. Probably the most popular of them is the Toeplitz model 
[8, 13-15, 21-24], closely related to it are circulant matrices 
[16, 17]. In other settings the number of parameters can be 
reduced by assuming that the covariance matrix is sparse 
[18-20], A popular sparse model is the banded covariance, 
which is associated with time-varying moving average models 
[15, 21]. Other examples of structured covariances include 
factor models [25], permutation invariant models [26], and 
many others. 

An important common feature of the works listed in the 
previous paragraph is that they consider a single and static 
environment where the structure of the true parameter matrix, 
or at least the class of structures, as in the sparse case, is 
known in advance. Often, this is not the case and techniques 
are needed to learn the structure from the observations. A 
typical approach is to consider multiple datasets sharing a 
similar structure but non homogeneous environments [27-29]. 
This is, for example, the case in covariance estimation for 
classification across multiple classes [30], A related problem 
addresses tracking a time varying covariance throughout a 
stream of data [15, 31], where it is assumed that the structure 
changes at a slower rate than the covariances themselves [32], 
Here too, it is natural to divide this stream of data into 
independent blocks of measurements. 

Our goal is to first rigorously state the problem of joint 
covariance estimation with linear structure, and derive lower 
performance bounds for the family of unbiased estimators. 
Secondly, we propose and analyze new algorithms of learning 
and exploiting the linear structure to improve the covariance 
estimation. More exactly, given a few groups of measure¬ 
ments having different second moments each, our target is 
to determine the underlying low dimensional affine subspace 
containing or approximately containing the covariance of all 
the groups. The discovered hyperplane is further used to 
improve the matrix parameter estimation. Most of the previous 
works considered particular cases of this method, e.g. factor 
models, entry-wise linear structures like in sparse and banded 
cases, or specific patterns like in Toeplitz, circulant and other 
models. Our algorithm treats the SCM of the heterogeneous 
populations as vectors in the space of matrices and is based on 
application of the principal component analysis (PCA) to learn 
their low-dimensional structure. To make the performance 
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analysis a tractable problem, we assume the data is Gaussian in 
the corresponding sections, however noting that the algorithm 
itself provides a good estimation in a much wider class of 
distributions. 

The paper is organized as following. First, we introduce the 
notations, state the problems and provide examples of affine 
structures motivating the work. Then we derive the lower 
performance bound for a class of unbiased estimators. Next 
we propose our algorithm and provide its upper performance 
bound. We conclude by numerical simulations demonstrating 
the performance advantages of the proposed technique and 
supporting our theoretical results. The Appendix section con¬ 
tains all the proofs and auxiliary statements. 

A. Notations 

Denote by S(p) the l = Pip+t) dimensional linear space 
of p x p symmetric real matrices. 1^ stands for the d x d 
identity matrix. For a matrix M its Moore-Penrose generalized 
inverse is denoted by IVC. For any two matrices M and 
P we denote by M0P their tensor (Kronecker) product. 
(7i(M) ^ ^ ay(M) ^ 0 stand for the singular values 

of a rectangular matrix M of rank not greater than r. If 
M £ S(p), we denote its eigenvalues by Ai(M) ^ ... A P (M). 
||-|| F denotes the Frobenius, and j|-|| 2 - the spectral norms 
of matrices. For any symmetric matrix S, s = vech (S) 
is a vector obtained by stacking the columns of the lower 
triangular part of S into a single column. In addition, given 
an l dimensional column vector m we denote by mat (m) the 
inverse operator constructing a p x p symmetric matrix such 
that vech (mat (m)) = m. Due to this natural linear bijection 
below we often consider subsets of S(p) as subsets of R 1 . 
In addition, let vec (S) be a p 2 dimensional vector obtained 
by stacking the columns of S, and denote by I its indices 
corresponding to the related elements of vech(S). 
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II. Problem Formulation and Examples 

Consider the heterogeneous measurements model, namely, 
assume we are given K ^ l = p ( p + 1 ' 1 groups of real p 
dimensional normal random vectors 

x},...,xj l , ..., x£,...,x£, ..., x^,...,x” . (1) 

with n i.i.d. (independent and identically distributed) samples 
inside each group and covariances 

Q fc = E[xJ.x^ T ], k = 1,..., K. 

We assume that 

QfcGP(p), k=l,...,K, (2) 

belong to an r dimensional affine subspace of S(p). Our main 
goal is to estimate this subspace and use it to improve the 
covariance matrices estimation. As a matter of application we 


will assume that r is known or not known in advance. In the 
latter case we will also explain how to determine it from the 
data. 

Let us list the most common affine covariance constraints 
naturally appearing in typical signal processing applications. 

• Diagonal: The simplest example of a structure is given 
by diagonal matrices. The covariance matrix is diagonal 
when the noise variates are independent, or can be assumed 
so with great precision. In this case the low dimensional 
subspace containing the structured matrices is r = p 
dimensional. 

• Banded: In a similar manner it is often reasonable to 
assume non-neighboring variates of the sampled vectors to 
be uncorrelated. Claiming that i-th element of the random 
vector is uncorrelated with the h -th if \i — h\ > b leads to 
6-banded covariance structure. The subspace of symmetric b- 
banded matrices constitutes an r = ( 2 P-fr)(fr+i) dimensional 
subspace inside S(p). Banded covariance matrices often 
naturally appear in time-varying moving average models or 
in spatially distributed networks, [14, 21-23], 

• Circulant: The next common type of structured covariance 
matrices are symmetric circulant matrices, defined as 


( m 1 

m 2 

m 3 . 

m p \ 

m p 


?n 2 . 

■ 1 

\rri2 

m 3 

TO 4 

TOi / 


with the natural symmetry conditions such as m p = m 2 , 
etc. Symmetric circulant matrices belong to an r = p/2 
dimensional subspace if p is even and (p + l)/2 if it is 
odd. Such matrices are typically used as approximations to 
Toeplitz matrices [9] which are associated with signals that 
obey periodic stochastic properties for example the yearly 
variation of temperature in a particular location. A special 
case of such processes are the classical stationary processes, 
which are ubiquitous in engineering, [16, 17]. 

• Toeplitz: A natural generalization of circulant are Toeplitz 
matrices. The covariance matrix appears to possess Toeplitz 
structure whenever the correlation between the i- th and the 
h- th components depend only on the the difference \i — h\. 
The dimension of a subspace of Toeplitz symmetric matrices 
is r = p. The two classical models for spectrum estimation 
utilizing the Toeplitz structures are the moving average 
(MA) and the autoregressive (AR) processes, [14, 15], 
Interestingly, the finite MA(a) process can be easily shown 
to be equivalent to a-banded Toeplitz covariance model, 
having the dimension r = a + 1. 

• Proper Complex: Many physical processes can be conve¬ 
niently described in terms of complex signals. For example, 
the most frequently appearing model of complex Gaussian 
noise is the circularly symmetric one [33]. Such noise is 
completely characterized by its mean and rotation invariant 
hermitian covariance matrix Q c . Denote centered proper 
complex distributions as 

x ~ CAf(0, Q c ). 
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The real representation of the covariance reads as 

H _l/Re(Q c ) -Im(Q c )\ 

W 2 ' v Im(Q c ) Re(Q c ) j ' 


(4) 


We see that Q n possesses a simple linear r = p 2 /4 
dimensional structure, where p is the dimension of Q R , 
which is always even. An analogous reasoning applies to 
proper quaternion covariances [34]. 

In the following it will be convenient to use a single matrix 
notation for the multiple linearly structured matrices. Set 


q k = vech(Q fe ), fc = (5) 

Y= [qi,...,q x ], (6) 


Using these notation, the prior subspace knowledge is equiv¬ 
alent to a low-rank constraint 


where we have used the following notation: 

dqfc _ [ dqfc <9qfc dch, ' 
dU <9u! du 2 ' du r 

and the formulas 

dq k _ d\5z k _ J j dqk _ dJJz k _ 

duj duj Zk l: dzk dzk 

Lemma 1. 

rank(J) = Ir + Kr — r 2 

Proof. The proof can be found in Appendix A. 
This lemma implies that J is rank deficient, as 


( 12 ) 


(13) 


(14) 

□ 


Y = UZ, (7) 

where U £ R. lxr and Z = [z 1; ... ,zk] £ K rxif . Essentially 
our problem reduces to estimation of Y given K groups of 
i.i.d. measurements , coming from the corresponding 

centered distributions with covariances Q/,.. k = 1 ,,K. 


III. Lower Performance Bounds 

Before addressing possible solutions for the above joint 
covariance estimation problem, it is instructive to examine the 
inherent performance bounds. In this section we assume the 
rank r is known and the samples are normally distributed. 
To obtain such performance bounds we use the Cramer- 
Rao Bound (CRB) to lower bound the Mean Squared Error 
(MSE) of any unbiased estimator Y of Y, defined as 


MSE = E 


_ 

2 " 

Y-Y 

F 


( 8 ) 


The MSE is bounded from below by the trace of the 
corresponding CRB matrix. To compute the CRB, for each 
i we stack the measurements xj. ; from (1) into a single vector 

(A ( 

' A/"(0, Qext), i = 1, • • • ,n, (9) 


\xj 


K/ 


where the extended covariance is given by 


rank(J) = Ir + Kr — r 2 ^ min [IK, Ir + Kr], (15) 


reflecting the fact that the parametrization of Q ext or Y by 
the pair (U, Z) is unidentifiable. Indeed for any invertible 
matrix A, the pair (UA, A~ 1 Z) fits as good. Due to this 
ambiguity the matrix FIM(U, Z) is singular and in order to 
compute the CRB we use the Moore-Penrose pseudo-inverse 
of FIM(U, Z) instead of inverse, as justified by [35]. Given 
n i.i.d. samples x% i = 1,..., n, we obtain 

CRB = — JFIM(U, Z)1 J T . (16) 

n 

For the Gaussian population the matrix FIM(U, Z) is given 
by 

FIM(U, Z) = Vdiag { [Q- 1 ® Q^ 1 ]^} J, (17) 

where [M] xx is the square submatrix of M corresponding 
to the subset of indices from X. The bound on the MSE is 
therefore given by 


MSE Si Tr (CRB) = -Tr (FIM(U, Z)1 J T J) 


= —TV 
n 


"diag {[Q fc 1 <S) Q fe 1 ] XiX } ■ 


it 


J j J 


(18) 


Qext(U, Z) = diag{Qi,..., Qa'} 

= diag{mat (Uzi),... ,mat (Uza')} • (10) 


Here the operator diag {Qi,..., Qa'} returns a block- 
diagonal matrix of size pK x pK with Q;.-s as its diagonal 
blocks. The Jacobian matrix of (9) parametrized as in (10) 
reads as 


(^Qext 

0(U, Z) 


/zf <E» I; 

z%®h 



dU 



U 0 
0 U 


c)qi 

dz± 

0 


0 

d<l2 

dz 2 


0 ^ 
0 


0 0 
: ^ 


dq k 
dz K , 


nlK x (Ir+Kr) 


(ID 


\zT®h 0 0 ... U J 


Denote 


A = min Ap(Qfc), A = max Ai(Qfc). (19) 

k k 

To get more insight on (18) we bound it from below 

MSE Si ^=-Tr ([J T J] f J T j) 

2A 2 2\ 2 

= ^^rank(J) = -^{lr + Kr — r 2 ). (20) 

n n 

The dependence on the model parameters here is similar to 
that obtained by [36] for the problem of low-rank matrix 
reconstruction. An important quantity is the marginal MSE 
per one matrix Qfc, which is proportional to 

MSE Ir — r 2 r 

— ~K~ ~ — TC - 1 -‘ ( 21 ) 

A An n 
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IV. TSVD Algorithm 
A. The Basic Algorithm 

In this section we present our algorithm for recovery of 
the true underlying covariances Qi,..., Q^-. We make use of 
the representation (7) of Y with a small change in notation 
consisting in separation of the columns mean 

Y = UZ + uo ,1]. (22) 


the thresholding approaches rely on prior knowledge about the 
power or spectral characteristics of the noise, which are known 
a priori, measured from the secondary data or can be somehow 
estimated from the samples. In our problem prior information 
is unavailable and such estimations can not be performed. 
Instead, we propose a different approach utilizing the fact that 
the noisy measurements come from Wishart populations. 

Consider the expected signal power of S/, 


This is done in order to improve the performance of the 
proposed PCA-based algorithm. Consider the SCM of the k -th 
group of measurements 

< 23 > 

i =1 

and let 

s k = vech (S fc ). (24) 

Denote by 

S = [sr,.. .,s K ], (25) 


the measurement matrix, and compute the average of the 
columns 

1 K 

u 0 = -^^s fc . (26) 

k =1 


E \\S k f F = \\Q k f F + -(TV (Q fc )) 2 . (32) 

n n 

For n large enough 

E||SJ 2 F »IIQJl(l + ^), (33) 

where c = - and 

n 


P(Qfe) = = cos- 2 (ZI, Qfc) (34) 

(Tr (Qfc)) 2 

is the spherecity coefficient, [39], measuring how close is Qfc. 
to the identity matrix. Now (33) implies that the unknown 
squared norm of the true covariance ||Qfc||^ is given by 


Consider the matrix 

S' = [si — u 0 , ...,sk~ u 0 ], (27) 

The SVD of S' reads as 

s '=°(o 

(28) 

where the singular values are sorted in the decreasing order 
and S £ R r ' x ' r . We propose to use the matrix 

Y' = UiSWf, (29) 

as an estimator of 

Y' = UZ = [UiUa] ^ [jj [WiW 2 ] T . (30) 

This approach is based on Eckart-Young theorem and we refer 
to it as Truncated SVD (TSVD) method [37]. Finally, for the 
estimator of Y we have 


IIQ*llF=E||Sdr(l + ^tL)) \ (35) 

We use the following result to estimate the unknown p( Qfc). 

Lemma 2. [39] Let p and n both tend to infinity in such a 
way that - —» c, then 

p(Sfc) —¥ p(Qfc) + c. (36) 

Based on this lemma, use p(Q k ) = p( Sfc) —c as an estimate 
of p(Qfc) and ||Sfc|| as an estimate of E ||Sfc||^ to obtain 

IIQfellF » IIS.II 2 , (l - ^y) = II^IIf - ■ 

(37) 

The ratio of the desired signal’s power || Y|| f to the power of 
measurements, E||S||^, can now be estimated by 


Y = Y' + u 0 1]. (31) 


B. How to Choose the Rank? 

In real world settings the true rank r of the structure 
subspace is rarely known in advance and one needs to estimate 
it from the data before applying the TSVD technique. It is 
instructive to think about rank estimation, followed by TSVD, 
simply as thresholding of the data singular values. A large 
variety of thresholding techniques exist, e.g. hard thresholding, 
see [38] and references therein. Unfortunately, almost none of 
them can be applied in our scenario due to two main reasons. 
First, most of them require the noise to be independent of the 
signal, which is not the case in our setting. Second, most of 


atSL-.-.S*-) 


\\nl 1 l Ef=i(Tr(S fc )) 2 

I|S|& « Ek=i l|Sfc|£ 


(38) 


This derivation suggests a simple rule of thumb for threshold¬ 
ing the spectrum. As an estimate of the rank r we take the 
number of largest singular values of S carrying the fraction a 
of the signal’s energy. 


f = argrnin 

L 


E^ 2 ( S ) 

;=i 



(39) 


Below we test this method of rank recovery in comparison to 
different approaches by numerical simulations. 
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C. TSVD Upper Performance Bound 


rems to get 


In this section we provide the performance analysis of the 
proposed TSVD algorithm in the Gaussian scenario, assuming 
r and Uq are known. We make the following natural assump¬ 
tion. 

Assumption 1. The condition number of the true matrix 
parameter Y is bounded from above by 

^ « kVF, (40) 

ov(Y) 

and the smallest singular value of Y is bounded from below 
by 

a r (Y) > e, (41) 


Y-Y 




UlSWi -UlSWi 


+ UlSWi - UlSWi 

< lls 


UlSWi - UlSWi 


Ui-Uj 

~b 

S - E 


+ aUY) 

W 1 -W 1 


F 


F 


l 


+ <U(Y)1 

1 spF- 

l|R|| 2 , 


Is- E 




< IIRII 


■(Y) 


S - s 


IIRII 


-(Y) 


^ IIRII 


V2 

<Tr{ Y) 


(o-i(Y) + 1 + ||R|| 2 ) + 1 • 


(47) 


□ 


where k and e do not depend on K, p , n or r. 

In addition, without loss of generality let u 0 = 0. 

Theorem 1. Under Assumption 1 with probability at least 

1 - 2e~ cp 


Y — Y 


< CiA (k + A) : 


,r{K + C 2 i) 


(42) 


This result quantifies the intuition that the error bound 
should depend on the intrinsic dimension r of the estimated 
subspace, rather than on the ambient dimension l. 

We now proceed to the second stage of the proof and 
develop a high-probability bound on the spectral norm of R 
in the following 

Lemma 6. with probability at least 1 — 2e _op , 


where c,C i and C 2 do not depend on or r. 

Proof Denote by R the zero mean noise matrix 



(48) 


R = S - Y. 


where C 3 does not depend on the model parameters K , p , n 
(43) or r 


The proof consists of two stages. First, using Weyl’s and Proof. The proof can be found in the Appendix B. 
Wedin’s Theorems we establish a deterministic bound on 
Y —Y depending on ||R|| 9 ,-y/r and k. On the second 
stage we use concentration of measure results to achieve high- 
probability bound on the norm ||R|| 2 , yielding the desired 
result. || Y — Y 


□ 


Plug (48) into (46) to obtain that with probability at least 

1 - 2 e _cp , 


max |< 7 j(S) - cr,-(Y)| < ||R| 


2 ’ 


< A 


Lemma 3. (Weyl, Theorem 4.11 from [40]) With the notations 
of Section IV, 


V2 


a i (Y) 



(44) 




<r r ( Y) <r r (Y) a r (Y) 

C 4 ,A(/c + A) (V2K + C 3 \Vi ) 


(49) 


Lemma 4. (Wedin, Theorem 4.4 from [40]) With the notations 
of Section IV, 


where C 4 does not depend on the model parameters and the 
last inequality is due to Assumption 1. We finally obtain that 
with probability at least 1 — 2e~ cp , 


<7 m in([U ± ] J U^), ( 7 mill (U J U) 


IIRII2 

°f (Y)' 


(45) 


Lemma 5. When r and u 0 are known, 

|y-y|| 




2 +1). 

(46) 


Proof. Use the triangle inequality, Weyl’s and Wedin’s theo- 


Y-Y 


. r ,w r . , t ,MVk + c 2 xVi) 

f CiX[k + A) - 7 =-, 

f Jn 


(50) 


and the statement follows. 


□ 


The obtained performance bound (42) suggests that the 
error is bounded with high probability by a product of a 
linear combination of K and l with r, rather than the ambient 
dimension l of the space. Compared to the bound on the MSE, 
(20), it shows that the proposed TSVD algorithm exhibits near- 
optimal dependence on the model parameters K,p,n,r. 
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V. Numerical Simulations 
A. TSVD Performance in Toeplitz Model 

For our first experiment we took a Toeplitz setting with 
p = 10, Uo = I p . The true covariances were positive definite 
matrices generated as = Uo+Sj=i z k^j / w h ere 

Dj has ones on the y-th and — j-th subdiagonals and zeros 
otherwise, r = p — 1 and zi were i.i.d. uniformly distributed 
over the interval [— |], Figure 1 shows the dependence of 

the MSE on n when K = l. In the unknown r case we 
took a power threshold for our TSVD algorithm, as defined 
in (39). For comparison we also plot the MSE-s of the SCM 
and its projection onto the known subspace structure and the 
true CRB bound given by (18). 

Figure 2 compares different thresholding techniques. As 
the benchmarks we took the Asymptotically Optimal Hard 
Thresholding (AOHT), proposed by M. Gavish and D.L. 
Donoho in [38] and the so-called “elbow method”. The AOHT 
approach consists in hard thresholding the singular values of 
a noisy low rank matrix X at a fixed level of 

T(X)= W (^)a m (X), (51) 

independent of the true rank value. Here the function uj(fi) is 
defined in [38] together with the numerical algorithm of its 
evaluation, and a rn (X) is the median of the singular values 
of X. As the authors demonstrate in [38], this approach pro¬ 
vides asymptotically optimal hard thresholding under specific 
assumptions on the noise (in particular, the noise column 
vectors are assumed i.i.d. white Gaussian). We applied the 
threshold r(X) to the original measurement S (AOHT-S in 
the figure), and its centered counterpart S' (AOHT-S-c in the 
figure). As the graph suggests, these two approaches give 
very close results and perform poorly, compared to our a- 
thresholding, when the number of measurements n in each 
group is relatively small. The “elbow method” consists in 
thresholding the spectrum of X at the level of the largest gap 
between the consecutive singular values of X. This intuitive 
rule follows the well known observation that the eigenvalues 
corresponding to the signal and noise usually group into 
clusters separated by a significant gap. This observation also 
relies on specific properties of the noise which can be violated 
in our settings. For comparison, we plot the empirical MSEs 
of the “elbow method” applied to S (Elbow-S in the figure) 
and S' (Elbow-S-c in the figure) correspondingly. 

In the third experiment we set n = 100 fixed and explored 
the dependence of the MSE on the number of groups I\ in 
the same setting as before. Figure 3 verifies that the marginal 
MSE depends on K as predicted by formula (21). 


B. TSVD Time-tracking 

For the second experiment we considered the problem of 
tracking a time-varying covariance in complex populations. 
We used the Data Generating Process (DGP) of Patton and 
Sheppard, [41], which allows for dynamically changing co- 
variances in the spirit of a multivariate GARCH-type model. 



Fig. 1. TSVD algorithm performance. 



Fig. 2. TSVD algorithm different thresholding methods. 


[42, 43], One of variations of this DGP suggests the following 
data model: 


x t = H t 1/2 y t) f = 1,... A'n, (52) 


where we assumed the generating data to be proper complex, 
y t ~ CAf(0 ,1) and defined the hermitian time-varying covari¬ 
ance H f to change according to the law 


H t 

H 4 


(l-^Ht-i+jSMtMf, 



t = 1,... Jin, 


(53) 

(54) 
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0.01 and used 90%-power threshold to discover the underlying 
low-dimensional structure. Figure 4 compares the temporal 
behavior of the MSE-s of the SCM, TSVD applied to it and 
the projection of SCM on the subspace spanned by proper 
covariances. The MSE-s were obtained by averaging the 
squared errors over 10000 iterations. 

VI. Conclusion 

In this paper we consider the problem of joint covariance 
estimation with linear structure, given heterogeneous measure¬ 
ments. The main challenge in this scenario is twofold. At 
first, the underlying structure is to be discovered, and then 
utilized to improve the covariance estimation. We propose 
a PCA-based algorithm in the space of matrices, solving 
the both tasks simultaneously. In addition, we analyze the 
performance bounds of this algorithm and show it is close to 
being optimal. Finally, we demonstrate its advantages through 
numerical simulations. 


Fig. 3. Marginal TSVD algorithm performance, n = 100. 


Appendix A 



Fig. 4. TSVD algorithm learning the low-dimensional subspace with time. 


Here M t are random p x p matrices with i.i.d. standard 
normally distributed entries. Ho is arbitrary positive-definite 
hermitian and /3 £ [0,1]. 

The low-dimensional structure appearing in this setting is 
due to properness of the covariances (see (4)). In order to 
explore it, the obtained complex data was represented as 
double-sized real measurements. Each n clock ticks we formed 
the SCM S± of the last n measurements, where t was the last 

time count. Then we concatenated the vector vech to 

the matrix Y i £ R ,x (n _1 ) of growing size to obtain Yi 
and applied our TSVD algorithm to it. Thus, our structure 
knowledge was updated every n ticks, and we expected the 
error of the covariance estimation to decrease with time. We 
performed the experiment with p = 4, K = 100, n = 30, /3 = 


Proof of Lemma 1. Let us calculate the column rank of J. 
Complete the columns of U to an orthonormal basis U of 
M and multiply J on the left by an invertible square matrix 


Ik U to obtain 

/zf <g> U 

I/xr 

0 

. 0 \ 

J = ® u) J = 

z^®U 

0 

I Ixr 

0 


W K <g> u 

0 

0 . 

I Ixr/ 


(55) 

where I i xr denotes the first r columns of I;. Let us look at 
the first l rows of J 

Ji = (z^®U I Ur 0 ... o). (56) 

By subtracting form the first l columns necessary linear 
combinations of I/ Xr this matrix can be brought to the form 

J'i = (z[ 0 U' I lxr 0 ... o), (57) 


where the first r rows of U' are zero. Performing the same 
procedure for all the row submatrices J / c , k = 1..... A' yields 



/ zf ® U' 

Ilxr 

0 

. o \ 

j' = 

■ l t 2 ® U' 

0 

I/xr 

0 


VzJ- <g> U' 

0 

0 

I Ixr/ 


(58) 


of the same rank as J. The first l columns of J' read as 



Wiuy VW 


(59) 


and therefore, 

rank(J' 1 ) = rank(U , ) = l — r. 


(60) 
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Performing the same operation on the column submatrices 

J°, 3 = !> •••,»* 8 ives 

rank(J) = rank(J') = r(l — r) + Kr = Ir + Kr — r 2 . (61) 

□ 


Lemma 9. [47] Let W = - Y^i= l w i w f’ where w, are i.i.d. 
copies of w ~ AA(0,1), then 


|W 


I|| 2 > 



(70) 


Appendix B 

The following definitions and basic lemmas from the theory 
of Orlicz spaces can be found in such classical references as 

[44] , or in a more convenient form in recent works such as 

[45] , 


Definition 1. (Subexponential variable and its Orlicz norm) 
Let x be a nonnegative random variable such that 


P(x ^ p + t) ^ 


ex p (-£*) > 
ex p (-•&)» 


t G 


0 — 

u ’ b 



(62) 


for some p, a and b, then we say that x is subexponential with 
parameters ( o 2 ,b ). The Orlicz norm of x is defined as 


ll*lk = ^>o{ Eexp (^) <2 }' (63) 

The intuition behind the subexponential norm basically says 
that if 11 x 11is finite, the |x|| ( , :)| -scaled tails of x are not 
heavier than that of a standard exponential variable. 


Lemma 7. A random variable x is subexponential with 
parameters (a 2 ,b) iff 

E(exp(Ax)) ^ exp > v l A l < (64) 

Corollary 1. The Orlicz norm of a subexponential variable x 
with parameters (cr 2 ,b) is bounded by 

IMU < max b) , (65) 

Proof Lemma 7 implies 

Eexp ^ — ^ < exp , Vk > b. (66) 

Bound the right hand side of the last inequality by 2 to get 
the statement. □ 


Next we use the following tail bound on a y 2 variable to 
calculate the Orlicz norm of its centered version. 

Lemma 8. [46] Let ( ~ y- 2 (p), then 

P (C — p ^ 2\/pt + 2 1) < e _t , (67) 

P (p - C > 2 v / pf) < e _t . (68) 

Corollary 2. Let f ~ X 2 (p)> then the Orlicz norm of \( — p\ 
is bounded by 

IIIC-pIIU^ Vp- (69) 

Proof Follows by a straight forward calculation from Lemma 
8 and Corollary 1. □ 

In the course of proof of Theorem 1 we also use the 
following tail bound on the spectral norm deviations of a 
standard Wishart matrix from its mean. 


Theorem 2 (Theorem 1 from [48]). Let G E-P 

be independent random vectors (not necessarily identically 
distributed). Assume that there exists f>, such that 




(71) 


where S l 1 C is a unit sphere, and there exists K, > 1, 
such that 


P [ -^-= max llrfell ^ AEmax 



then with probability at least 1 — 2e cv ^ 




a -Vi 


(72) 


l|R|| 2 < ||E[RR T ]||2 /2 + C(tA + /C)y/J, (73) 

where R = [ri,..., r k] and c, C are universal constants. 

Proof of Lemma 6. To make the calculations easier in this 
proof we assume that the columns of R, 


R=[ri,...,r*], (74) 


are constructed as r/, : = vec (S/ ;: — Q/, ; ) and therefore, R G 
R p xK . This may only affect the numerical constants by the 
magnitude up to 2. In order to apply Theorem 2 we need to 
make sure the conditions (71) and (72) hold. For this purpose 
fix a matrix P and consider the univariate variable |(rj,,p)|, 
where p = vec (P), for some fixed k. Note that 


( r fe, P) = Tr ((S* - Qfc)P) = ^ xf Px’ - Tr (Q fc P), 


(75) 

and among all the matrices P of norm one, the Orlicz norm 
of the right-hand side of (75) is maximized by 


Pi = 


Qfc 


-1 


Q 


(76) 


k II F 


For this choice of P;,. we obtain 

|Tr ((Sfc — Qfc)P) | = 


^M\\ ~P\ slC-nPl 


Qk 1 


k II F 


Q 


k IIf 


(77) 

where £ ~ X 2 ( n P ) and, thus, 

lllm , , to n^Mi, / 2 V™P / 2A t(Q k) 

iipT-. I " 1 ' ((St ■ Q * )P) "L s yiqpr « -XT' 

(78) 

(79) 


where we have used the inequality 


|Q 


-11 




Vp 


ik llf " Ar(Qfe)' 

The bound (78) finally yields 


2A 


max max |||Tr((S fc - Q fe )P) 11 |. 01 < - 7 =, (80) 
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and therefore the (71) holds true for 


/ -™L 
0 VTi ' 


(81) 


In order to verify the boundedness condition (72), consider the 
variable 


Ikfcll ||Sfc - Qfe|| F A Tl 

max —■= = max- ^ — max IIW — 1 

fc fc p y/p k " 


2 > 


where W). = Q fc 1 ^ 2 S k Q k 1/0 . Denote 


5 = max 


and use Lemma 9 to obtain 


1 , 


1/4' 


(82) 


(83) 


1 


p fc 


A 


- max ||r J| ^ ICS < P —— max ||Wfc — I|| 2 ^ ICS 


Vp 


IIW.-IILaO^ 


ks€- 


^ 2 K exp — - 


(84) 


Set 


to get 


K= -A-(V2(l + ln2ir/p) + l) 

\/nd 


- max IIr fc II ^ ICS ^ e p , 

p k 


(85) 


( 86 ) 


thus the conditions of Theorem 1 from [48] are satisfied with 
the constants <fi and 1C. Let us bound the spectral norm of 
E[RR t |. For this purpose note that 


where 


e[rr t ]|| 2 < a ||e[mm t ]|| 0 , 


M= [mi,.,., m K ] 


(87) 


( 88 ) 


and rn A . = vec (W fc - I), with W k = A i wfwf T , and 
w( : ~ Af( 0, 1 ), i = 1,... ,n, k = 1,... ,K. Note that for 
C ~X 2 (n), 

E[(C — n ) 2 ] = 2n, 

and for 7 w, ~ A/"(0,1), i = 1,..., n, i.i.d.. 


(89) 


E 


Ki=1 


o; 7 - 


= n, 


to obtain 


2nl v 0 


E[MM ] = — I 0 nli- p nl z _, 

0 nli- p nli- p 


n* 


(90) 


' xp , (91) 


where for convenience we have ordered the elements of 
vec (A) in such a way that the diagonal of A goes first, then 
the upper triangular part and then the lower triangular part. 
We, therefore, obtain. 


E[RR r ] 


2A 2 


(92) 


Finally we obtain that with probability at least 1 — 2e cp , 

||R|| 2 < Ai/ 2 +c 2 (0 + /c) 

\ n 

T /2 , C- 2 A , 1 / 2(1 + In 2K/p) + 1 ^ [f 

- X \n + Jn{ 2+ 5 ) V K 

where we have used the fact that 

i/2(l + hi2A7p) + l _ i/2(l + In 2IC/p) + 1 


5 

max 

i, (§) 1/4 ' 


5 \p J 


is bounded. Now replace p 2 back by l, which can at most affect 
the constants by a factor up to 2 , to get the statement. □ 
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